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Abstract 

Recently, the theory of diffusion maps was extended to a large class of local kernels with exponential decay which 
were shown to represent various Riemannian geometries on a data set sampled from a manifold embedded in Euclidean 
space. Moreover, local kernels were used to represent a diffeomorphism between a data set and a feature of interest 
using an anisotropic kernel function, defined by a covariance matrix based on the local derivatives DIH. In this paper, 
we generalize the theory of local kernels to represent degenerate mappings where the intrinsic dimension of the data 
set is higher than the intrinsic dimension of the feature space. First, we present a rigorous method with asymptotic 
error bounds for estimating from the training data set and feature values. We then derive scaling laws for the 
singular values of the local linear structure of the data, which allows the identification the tangent space and improved 
estimation of the intrinsic dimension of the manifold and the bandwidth parameter of the diffusion maps algorithm. 
Using these numerical tools, our approach to feature identification is to iterate the diffusion map with appropriately 
chosen local kernels that emphasize the features of interest. We interpret the iterated diffusion map (IDM) as a discrete 
approximation to an intrinsic geometric flow which smoothly changes the geometry of the data space to emphasize 
the feature of interest. When the data lies on a manifold which is a product of the feature manifold with an irrelevant 
manifold, we show that the IDM converges to the quotient manifold which is isometric to the feature manifold, thereby 
eliminating the irrelevant dimensions. We will also demonstrate empirically that if we apply the IDM to features which 
are not a quotient of the data manifold, the algorithm identifies an intrinsically lower-dimensional set embedding of 
the data which better represents the features. 
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1. Introduction 

Often, for high-dimensional data and especially for data lying on a nonlinear subspace of Euclidean space, the 
variables of interest do not lie in the directions of largest variance and this makes them difficult to identify. In this 
paper we consider the supervised learning problem, where we have a training data set, along with the values of the 
features of interest for this training data. Throughout this manuscript we will assume that the training data set consists 
of data points which are near a manifold M embedded in an m-dimensional Euclidean space; we refer to M as the 
‘data space’ or ‘data manifold’. We also assume that we have a set of feature values corresponding to each training 
data point, and these feature values are assumed to lie near a manifold N embedded in an ^-dimensional Euclidean 
space; we refer to N as the ‘feature space’ or ‘feature manifold’. We assume that the feature manifold is intrinsically 
lower-dimensional than the data manifold, so intuitively the data manifold contains information which is irrelevant 
to the feature, and we refer to this information broadly as the ‘irrelevant variables’ or the ‘irrelevant space’. In some 
contexts we will be able to identify the irrelevant space explicitly, for example the data manifold may simply be a 
product manifold of the feature manifold with an irrelevant manifold. However, more complex relationships between 
the data manifold, feature manifold, and irrelevant variables are possible. We will think of the feature space as arising 
from a function defined on the data space, and our goal is to represent this function. However, this will only be possible 
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in some contexts such as the product manifold described above. More generally, our goal is to find a mapping from 
the data space to an intrinsically lower-dimensional space which contains all the information of the feature space but 
is, as far as possible, independent of the irrelevant variables. 

Recently a method for formally representing a diffeomorphism between manifolds using discrete data sets sampled 
from the manifolds was introduced in [3]. In particular, a diffeomorphism is represented using a local kernel to pull 
hack the Riemannian metric from one manifold onto the other. With respect to the intrinsic geometry of the local 
kernel, the manifolds are isometric, and the isometry can be represented by a linear map between the eigenfunctions 
of the respective Laplacian operators. In this paper, we consider the more difficult case when the manifolds are not 
diffeomorphic, so that one manifold may even be higher dimensional than the other. This represents the scenario 
described above, where some of the variables of a high-dimensional data set may be irrelevant to the features of 
interest. 

The challenge of having irrelevant variables is that it violates the fundamental assumption of differential geometry, 
namely that it is local. This is because data points which differ only in the irrelevant variables will be far away in 
the data space and yet have the same feature values. This fundamental issue is independent of the amount of data 
available and is illustrated in Figure 1. Namely, if the feature of interest is the radius of an annulus, then points on 
opposite sides of the annulus are closely related with respect to this feature of interest. Conversely, points which are 
far away in the feature space may appear relatively close in data space; this can occur when many of the irrelevant 
variables are very close. Of course, in the limit of large data, points being close in data space implies that they are 
close in feature values. However, a large number of irrelevant variables can easily overwhelm any finite data set due 
to the curse-of-dimensionality. Intuitively, the presence of irrelevant variables makes it difficult to determine the true 
neighbors. 



Figure 1: Top: Original data set colored according to the desired feature (leftmost) followed by four iterations of the diffusion map using a local 
kernel defined in Section 4. Bottom: Original data set showing the 200 nearest neighbors (red) of the blue data point, where the neighbors are 
found in the corresponding iterated diffusion map space. Notice that as the iterated diffusion map biases the geometry towards the desired feature 
(the radius), the neighbors evolve towards the true neighbors with respect to the desired feature. 


Intuitively our goal is to determine the true neighbors of every point in the data space, meaning the points in the 
data space which have similar feature values regardless of the values of the irrelevant variables. The key difficulty 
is that we need a method which can be extended to new data points, since the goal of representing the map 7T is to 
be able to apply this map to new points in data space. In order to learn the map 7T we will assume that we have a 
training data set for which the feature values are known. Of course, for the training data set, we could easily find 
the true neighbors of the training data points by using the known feature values. However, finding the neighbors 
using the feature values cannot be used for determining the true neighbors of a new data point, since the goal is to 
determine the feature values of the new data point. Instead, we propose an iterative mapping which smoothly distorts 
the data space in a way that contracts in the directions of the irrelevant variables and expands in the directions of the 
features. Crucially, this iterative mapping can be smoothly extended to new data points for which the feature values 
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are unknown, allowing us to extend the feature map to these new data points. 

In this paper, we introduce the Iterated Diffusion Map (IDM) which is an iterative method that smoothly distorts 
the data space in a way that contracts in the directions of the irrelevant variables and expands in the directions of the 
features. We illustrate our method for the purposes of intuitive explanation in Figure 1. In this example, the original 
data set is an annulus in the plane, but the variable of importance (represented by color in the top images) is the radial 
component of the annulus, meaning that the angular component is an irrelevant dimension of the manifold. The initial 
neighborhood is simply an Euclidean ball in the plane as shown in the bottom row of images. In the subsequent 
images we apply the diffusion map multiple times to evolve the data set in a way that biases it towards the desired 
feature. Notice that as the geometry evolves, the notion of neighborhood evolves. In the bottom right image, we see 
that after four iterations of the diffusion map, the notion of neighbor has grown to include any points which have 
the same radius, independent of their angle. Moreover, after four iterations, we see that points that were initially 
very close neighbors, namely points that have the same angle but slightly different radii, are no longer neighbors. So 
after applying the IDM, the notion of neighbor becomes very sensitive to the feature (radius) and independent of the 
irrelevant variable (angle). 

As we will explain in Section 4.2, the example in Figure 1 has a particularly nice structure, namely the full data set 
is a product space of the desired feature with the irrelevant variables. When this structure is present we will be able to 
interpret the iterated diffusion map as a discretization of a geometric evolution which contracts the irrelevant variables 
to zero, thereby reconstructing the quotient map. When the product space structure is not present, the iterated diffusion 
map recovers a more complex structure which is not yet theoretically understood. In Section 4.3 we will give several 
simple examples of both product spaces and non-product spaces that illustrate the current theory and its limitations. 
Naturally, if one wishes to understand every feature of a data set, there is no advantage to the iterated diffusion map. 
However, often we can identify desirable features in training data sets, and the IDM finds an extendable map to an 
intrinsically lower dimensional space which better represents the desired features. 

As we will see below, the construction of IDM requires several tools. In Section 2.1, we will show that iterating the 
standard diffusion map of [4] has no effect (after the first application of the diffusion map, subsequent applications will 
approximate the identity map when appropriately scaled). We will see that the isotropic kernel used in the standard 
diffusion map yields a canonical isometric embedding of the manifold. In Section 2.2, we review how local kernels, 
can change the geometry of the manifold and obtain an isometric embedding with respect to the new geometry. 
Local kernels are a broad generalization of the isotropic kernels used in [4] and were shown in [3] to be capable of 
representing any geometry on a manifold. 

To construct a local kernel that emphasizes the feature directions, we will need to estimate the derivative of the 
feature map, D7T. In Section 3, we give a rigorous method of estimating D7T, including asymptotic error bounds, 
based on a weighted local linear regression. Moreover, in Section 3.1, by applying this weighted local linear regression 
from the manifold to itself we derive scaling laws for the singular values of the local linear structure near a point on 
the manifold described by the data. The scaling laws of the singular values yield a robust method of identifying 
the tangent space of the manifold near a point. Finally, these scaling laws allow us to devise more robust criteria 
of determining the intrinsic dimension of manifold, as well as the local bandwidth parameter of the diffusion maps 
algorithm. 

With the tools of Sections 2 and 3, our goal is to use D7T to construct a new geometry on the data set that 
emphasizes the feature of interest. In Section 4 we will see that naively forming an anisotropic kernel (following [3]) 
using a rank deficient matrix will not satisfy the assumptions that define a local kernel. So, in order to represent 
the feature map 7T, we cannot directly apply the local kernels theory of [3], which only applies to diffeomorphisms. 
Instead, in Section 4 we introduce the IDM as a discrete approximation of a geometric flow that contracts the irrelevant 
variables and expands the feature variables on the manifold. In Section 4.2 we show that when the data manifold is the 
product of the feature manifold with irrelevant variables, this geometric flow will recover the quotient map from the 
data manifold to the feature manifold. In Section 4.3 we give several numerical examples demonstrating the IDM and 
we also include the IDM numerical algorithm in Appendix A. We close the paper with a short summary, highlighting 
the advantages and limitations. 
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2. Background 

In this section, we review recent key results that are relevant to the method developed in this paper. First, we 
remind the readers that, up to a scalar factor, a diffusion map is an isometric embedding of the manifold represented 
by a data set. Second, we briefly review the recently developed method for representing diffeomorphism between 
manifolds [3], which we will use as a building block. 

2.7. The Diffusion Map as an Isometric Embedding 

A natural distance that respects the nonlinear structure of the data is the geodesic distance, which can be approx¬ 
imated as the shortest path distance. However, the shortest path distance is very sensitive to small perturbations of a 
data set. A more robust metric that also respects the nonlinear structure of the data is the diffusion distance which is 
defined as an average over all paths. This metric can be approximated by the Euclidean distance of the data points in 
the embedded space constructed by the diffusion maps algorithm [4]. 

For a Riemannian manifold M with associated heat kernel k{t, x, y), we can define the diffusion distance as, 

Dt{x, yf = \\k{t, X, ■) - k{t, y, •)ll£ 2 (;vi) = f (k(t, x, u) - k(t, y, u)fdV{u), 

Jm 

for x,y e A4 where dV is the volume form on A4 associated to the Riemannian metric which corresponds to k. We 
can write the heat kernel as k(t, x, y) = (y) where is the Dirac delta function and A is the (negative definite) 

Laplace-Beltrami operator on At with eigenfunctions A(fi = where 0 = Aq > Ai > A 2 > .... Using the Plancherel 
equality the diffusion distance becomes, 

00 00 

D,{x,yf = \\e‘^5:, - ^ e^‘^‘{>pi{x) - >Piiy)f, 

i=l i=l 

where the term / = 0 is zero since ^0 is constant. Defining the diffusion map by, 

for M sufficiently large, the diffusion distance is well approximated by the Euclidean distance in the diffusion coordi¬ 
nates, Dt{x,y) ^ ||0^(x) - 0^(y)||. The key to making this idea practical is the algorithm of [4] which uses the data set 
{xi} sampled from a manifold A1 c M'” to construct a sparse graph Laplacian L that approximates the Laplace-Beltrami 
operator, A on At. 

Of course, the dimension M of the diffusion coordinates will depend on the parameter t, which is intuitively a kind 
of coarsening parameter. For small t we have, 

{e‘^6,,6y) = i47Ttr‘‘'^e-‘‘^^^’y'>"^^*‘\uoix,y) + 0(1)), 

where dg(x,y) is the geodesic distance and d is the intrinsic dimension of At (see for example [7]). The function 
uo(x,y) is the first term in the heat kernel expansion. In [7] the following formula is derived for uo(x,y), 

uo(x,y) = | 4 exp;‘)(y)y 2 ^ + 0(dg(x,yf)\''^ = l+0(dg(x,yf) 

where exp_^ is the exponential map based at x, and the expansion follows from noting that exp_^ is a smooth map with 
first derivative equal to the identity at x and second derivative orthogonal to the tangent plane. Using the expansion of 
uo(x,y), for dg(x,y) sufficiently small, we have the following expansion of the heat kernel, 

{e‘^5^, 5y) = + 0{t, dg(x, yf)), (1) 

and below we will bound the error by the worst case of the intrinsic dimension, namely d = 1. Using the heat kernel 
expansion (1), we can expand the diffusion distance as, 

DAx, yf = df) + {e^‘%, 6y) - 2 {e^‘%, Sy) = (STrr)-^/^ ^2 _ y))) 

= - dg{x,yf /{St) + 0(dg{x,yf+ 0(t,dg(x,y))) 

= {SM)-‘‘^H4t)-^dg{x,y)Hl+0{dg(x,y)^/mi+0{t,dg(x,y))) 
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Based on (2) we define the rescaled diffusion map by, 

^,(x) = 

and the rescaled diffusion distance, D({x,y) = {2nY^*{AtY^‘^^^I^D,{x,y), which is approximated by, 

D,{x,yf K 110),(x) - OtCy)!!^ « {2nYI'^{AtYI'^^^D,{x,yf = dg{x,yf + 0{tdg{x,yf ,dg{x,yf ,dg{x,yf!t), 

so that for dg{x, j)^ « r « 1 the rescaled diffusion distance approximates the geodesic distance. 

Notice that the rescaled diffusion distance only approximates the geodesic distance when the geodesic distance 
is small. In particular, the diffusion distance should not be thought of as an approximate geodesic distance, as is 
clearly shown in Figure 2 below. For small distances, the rescaled diffusion distance and the geodesic distance 
also agree very closely with the Euclidean distance in any isometric embedding, as was shown in [4]. In fact, the 
diffusion map provides a canonical embedding of the manifold At up to a rotation in the following sense: For any 
isometric image TV = i{M) of M where l is an isometry, the diffusion map embeddings of TV and M with the same 
parameter t will differ by at most an orthogonal linear map. This is because the eigenfunctions of the Laplacian 
depend only on the geometry of the manifold, which is preserved by an isometric map, and for the eigenfunctions 
corresponding to repeated eigenvalues may differ only by an orthogonal transfomation. Moreover, the fact that the 
rescaled diffusion map preserves small geodesic distances implies that the rescaled diffusion map is approximately 
an isometric embedding (this was shown previously in [6] which provides more detailed bounds). In particular, this 
implies that, for t small and M large, the Laplace-Beltrami operator on O^(AI) is very close to the Laplace-Beltrami 
operator on M. One consequence of this fact is that if we iterate the rescaled diffusion map, the results should not 
change (up to a rotation). 



— Euclidean Distance 
Geodesic Distance 

- Rescaled Diffusion Distance, t=1e-4 

- Rescaied Diffusion Distance, t=1e-3 

Rescaied Diffusion Distance, t=1 e-2 


10 '® 10 '^ 10 ° 
Euclidean Distance 




Figure 2: For 2000 data points equally spaced on a unit circle in we compare the Euclidean distance, geodesic distance, and rescaled diffusion 
distances for t e {10“^, 10“^, 10“^} (left). We also show the spectra of the heat kernel for the corresponding values of t (middle) and the results 
of iterating the standard diffusion map compared to the original diffusion map eigenfunctions (right). 


To demonstrate these facts numerically, we generated TV = 2000 points equally spaced on a unit circle in 

M?. We applied the diffusion maps algorithm to estimate the eigenvalues At and eigenfunctions (fi(xj) of the Laplace- 
Beltrami operator on the unit circle. We should emphasize that it is crucial to correctly normalize the eigenfunctions 
(Pi using a kernel density estimate q(xj), that is, we require. 


1 _ r 

q(^j) ^ Jm‘ 


(fi(xfdV(x), 


where dV is the volume form on M inherited from the ambient space. See [2] for details on the Monte-Carlo integral 
above and a natural density estimate implicit to the diffusion maps construction. We then evaluated the rescaled 
diffusion map O^(Xy) for t e 10“^, 10“^} and compared the resulting diffusion distances bt{xi,Xj) ^ ||Or(v/) - 

0^(xy)|| to the Euclidean distances \\xi - Xj\\ and the geodesic distances dg(xi, xj) in Figure 2. Notice that for distances 
less than all the distances agree as shown above. We also show the spectra of the heat kernels, which are the 
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weights of the various eigenfunctions in the diffusion map embedding. Notice that for t large, the spectrum decays 
much faster, so fewer eigenfunctions are required for the diffusion distance to be well approximated by the Euclidean 
distance in the diffusion mapped coordinates. Finally, for t = 10“^, we performed an ‘iterated’ diffusion map, by 
computing the (rescaled) diffusion map of the data set xj = Or(Vy), in effect finding 0^(0^(x)). We then compared 
the eigenfunctions ipi{xj) from the first diffusion map with those (pi{xj) = ^/(0^(x)). Due to the symmetry of the 
unit circle, the eigenfunctions corresponding to repeated eigenvalues differed by an orthogonal linear map (meaning 
a phase shift in this case). After removing the phase shift, the eigenfunctions are compared in Figure 2. 

Since the diffusion map differs from the rescaled diffusion map by a scalar factor, the eigenfunction from 
iterating the standard diffusion map will also agree. The only purpose of the rescaled diffusion map is to exactly 
recover the local distances in the data set, and thereby to also find the same eigenvalues (since rescaling the manifold 
will change the spectrum of the Faplacian). Finally, if the standard diffusion map is used, the nuisance parameter e 
will have to be retuned in order to iterate the diffusion map, since the diffusion distances will be scaled differently 
than the original distances. We emphasize that the goal of this section is to show that iterating the standard diffusion 
map algorithm is not a useful method. However, in [3] it was shown that a generalization of diffusion maps to local 
kernels can be used to construct the Faplace-Beltrami operator with respect to a different metric. In the remainder of 
the paper we will see that when the new metric is induced by a feature map on the data set, iterating the diffusion map 
has a nontrivial effect which can be beneficial. 


2.2. Local Kernels and the Pullback Geometry 

The connection between kernel functions and geometry was introduced by Belkin and Niyogi in [1] and gener¬ 
alized by Coifman and Fafon in [4]. Assuming that a data set {v/} is sampled from a density p{x) supported on a 
J-dimensional manifold A1 c M'”, summing a function 2/ f{^i) approximates a the integral f(y)p(y) dV(y) where 
y(y) is the volume form on At inherited from the ambient space The central insight of [1, 4] is that by choosing 
a kernel function K(xi, xj) which has exponential decay, the integral K(xi, Xj)f(Xj) ^ K(xi,y)f{y)p{y) dV(y) is 
localized to the tangent space Tx^M of the manifold. 

The theory of [1, 4] was recently generalized in [3] to a wide class of kernels called local kernels which are 
assumed only to have decay that can be bounded above by an exponentially decaying function of distance. The results 
of [3] generalized an early result of [10] to a much wider class of kernels and connected these early results to their 
natural geometric interpretations. In this paper we will use the following prototypical example of a local kernel, 
since it was shown in [3] that every operator which can be obtained with a local kernel can also be obtained with 
a prototypical kernel. Fet C(x) be a matrix valued function on the manifold At such that each C(x) is a symmetric 
positive definite mxm matrix. Define the prototypical kernel with covariance C (and first moment of zero) by 


K(€, X, y) = exp 


(x-yfC(xr\x-y) \ 

26 }■ 


(3) 


The theory of local kernels [3] uses a method closely related to the method of Diffusion Maps of [4] to construct 
matrices and L* which are discrete approximations to the following operators, 

■Cf = \cijViVjf rf = Iv,V,(c,v/), (4) 

in the sense that in the limit of large data and as 6 ^ 0 we have ^ £ and L* ^ X*. Notice that the matrix 
valued function C(x) acts on the ambient space, whereas the tensor c(x) in the limiting operator X is only defined on 
the tangent planes of At. As shown in [3], only the projection of C(x) onto the tangent space TxM will influence the 
operator X. Thus, we introduce the linear map I{x) : TxM which acts as the identity on the tangent plane as 

a subspace of and sends all vectors originating at x which are orthogonal to TxM to zero. The map I projects the 
ambient space onto the tangent space so that J(x) is a J x m matrix and we define c(x) = J(x)C(x)J(x)^. 

Given data sampled from a J-dimensional manifold M embedded in Euclidean space the manifold M naturally 
inherits a Riemannian metric, from the ambient space. The standard Diffusion Maps algorithm uses an isotropic 
kernel (where the covariance matrix is a multiple of the identity matrix) to estimate the Faplace-Beltrami operator 
corresponding to the metric g^. It was shown in [3] that local kernels such as (3) can be used to approximate the 
Faplace-Beltrami operator corresponding to a new Riemannian metric g = where gw is a Riemannian 
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metric of TV = 9i(Ai) for diffeomorphism that satisfies c ^ Formally, we summarize this result as 

follows: 


Theorem 2.1 (Pullback geometry of local kernels, with nonuniform sampling). Let (At, gM) ^ Riemannian man¬ 
ifold and let ^ M be sampled according to any smooth density on At. Let Li \ M ^ N be a diffeomorphism 

and let yi = H{xi) and c(xi)~^ = DLiiXiYDLi{Xi). For the local kernel K in (3), define the symmetric kernel 
K(€, X, y) = K(€, X, y) + K(€, y, x). Then for any smooth function f on At, 


lim - 

N^oo £ 


/ N 

y^K(e, Xi, Xj)fiXj)IZi K(e, Xj, xi) 


N 

fK(e, Xi, Xj)IY,i K(e, xj, xi) 
j=i 


\ 


-fiXi) 


= AgfiXi) + 0(€) = Ag^if o + 0(e) 


where g(u, v) = gj^(DL-(u, DLiv). 

Theorem 2.1 follows directly from Theorem 4.7 of [3]. This result was used by [3] to represent a diffeomorphism 
between two manifolds. We assume we are given a training data set x/ e At c sampled from the data manifold 
At along with the true feature values, yt = Ll{xi), where y/ lie on TV = L((M). When 7T is a diffeomorphism, we 
can use a local kernel to pullback the Riemannian metric from TV onto At via the correspondence between the data 
sets. With this metric on At, the two manifolds are isometric, which implies that the Laplacians (Ag on At and on 
TV) have the same eigenvalues, and that the associated eigenfunctions of any eigenvalue are related by an orthogonal 
transformation [7]. 

In Section 3 we will give a rigorous method to approximate c(xi)~^ = DLiixiYDLi^Xi) using the training data. 
With this approximation, numerically we evaluate the local kernel 


K{e, Xi, xf) = exp 


\\iy]-i(xi)(xj - Xi)\\^\ 

2e ]■ 


(5) 


By Theorem 2.1, using the kernel (5), we approximate the Laplacian Ag = on M. Simultaneously, using 

the standard diffusion maps algorithm (with a = 1) we approximate the Laplacian on TV. Since {M,g) and 
(TV, gw) are isometric, the eigenvalues of Ag and A^^ will be the same and the corresponding eigenfunctions will 
be related by an orthogonal transformation. By taking sufficiently many eigenfunctions ipi and (fi on the respective 
manifolds, the eigenfunctions can be considered coordinates of an embeddings 0(x) = (^i(v),..., (fuiYiY and 0(y) = 
(^i(y),..., We can now project the diffeomorphism Li into these coordinates as. 


M —TV 


o 




C(M, g) « M" —-—> C(JV, gjv) ~ M" 

where ff = OoL{oO~^ is linear and can be estimated using linear least squares. Finally, to extend the diffeomorphism 
to new data points x e Mwe need only extend the map O to this new data point using standard methods such as the 
Nystrom extension. 

Notice that the key to the existence of the linear map ff is that the diffeomorphism Li induces a new metric on At 
that is isometric to the metric on TV. In Section 4 we will make use of this theorem for identifying feature in At that 
is relevant to the data in TV, even when Lf is not a diffeomorphism, but simply a mapping. However, we will first give 
rigorous results in Section 3 for approximating the tangent plane and the derivative DLf from data. 
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3. Tangent Spaces and Derivatives 

Section 2.2 shows that to build a global map : M ^ N = 9i{M) between data sets, we need to estimate the 
local linear maps D^(xi) between the tangent spaces and at each point x/ g M. Notice that D^(x) is 

di dj^ X d matrix, where d is the intrinsic dimension of At and Jw is the intrinsic dimension of TV. However, it will 
be more natural to represent as a map between the ambient spaces d T^M and W d Recall that 

J(x), is a J X m matrix valued function which projects from the ambient space onto the tangent space T^M c 
such that I(x)I(xy = Idxd- We introduce the notation 1 7vCH(x)) for the dj^ xn matrix valued function given by the 
projection from W onto the tangent space T^(x)N c W such that Ij^(^(x))1 7vCH(x))^ = Id^xd^- With this notation, 

D^{x) = Iu(^(x)yD9{(x)I(x). ( 6 ) 


In practice we will estimate DiH{x) e however, when used to construct a local kernel as in Section 2.2 only 

will influence the intrinsic geometry defined by the kernel. 

In this section we improve and make rigorous a method originally introduced in [3] that estimates the local linear 
maps from data using a weighted regression. To estimate Dii(xi), we take the nearest neighbors {xj} of x/ and use 
the correspondence to find yi = 7T(x/) and the neighbors yj = ^(xj). Note that yj may not be the nearest neighbors 
of yt due to the distortion of the geometry introduced by 7T; although if 7T is a diffeomorphism (as in [3]) the local 
distortion will be very small. In [3] they construct the weighted vectors 

dxj = exp - x,|p/(4e)) (xj - x,) dyj = exp {-\\xj - x,|p/(4e)) (>>; - yi), 

and define Dik{xi) to be the matrix which minimizes Yjj Myj - D^(xi)dxj\\^. Intuitively, the exponential weight is 
used to localize the vectors; otherwise the linear least squares problem would try to preserve the longest vectors xj-xt, 
which do not represent the tangent space well. This method of localization was used in [3] for estimating Dii(xi), 
and it is also closely related to a method of determining the tangent space of a manifold which was introduced in 
[9]. Using the foundational theory developed in [4] we will now make this method of finding tangent spaces and 
derivatives rigorous. 

Theorem 3.1. Let x/ be samples from M c with density p{x) and yi = 7T(x/) where 7T : At ^ TV c W. Define 
X to be a matrix with columns Xj = D(x)~^^^ exp j(Xy - x) = D(x)~^^^dxj and let Y be a matrix with columns 

Yj = Z)(x)“^/^ exp j (yj -y) = D(x)~^^^dyj, where 



Then, 


lim -YX^ = EYHix) + e7?«(x) + 0(e\ (7) 

v^oo e 

with Dik(x) as in (6) and Rji{x) e 

Proof. Following Appendix B of [4], let x,y g At with ||y - x|| < ^fe with e sufficiently small so that there is a 
unique geodesic y \ [Q, s] ^ M with 7(0) = x and = y. Let {et} be a basis for the tangent space TxM and 
define the projection of the geodesic onto the tangent plane by ut = (y - x, ei) = (y(s) - y(0), et). Locally, we can 
parameterize the manifold using a function q : TxM TxM^ so that y - x = (u, q(u)). We now use the Taylor 
expansion y(s) = 7(0) + 5 '7'(0) + s^y"(0)/2 + O(s^), where 7'(0) G TxM and 7"(0) is orthogonal to the tangent space. 
Combining the previous lines yields, 

(m, q{u)) = y-x = y(s) - y(0) = sy'(O) + ^7"(0)/2 + 


8 





which implies that u = 5 - 7 '(0) + 0(e^^^) and q(u) = s^y"(0)l2 + (9(6^/^). From Equation (B.2) in [4], we have 
II 7 - x|p = \\u\\^ + 0(€^). For V G TxM and w g we have, 

(7 - V, v) = ^ <r'( 0 ), v) + 0{e^'^) {y - x,w) = jl <r"( 0 ), w) + 0{e^'^) 

This shows that taking the inner product with vectors 7 - v in the ^[e neighborhood of x, vectors in the tangent space 
are of order- ^[e and vectors in the orthogonal complement are of order- 6 . 

Let {x/} be discrete data points sampled from M. Recall from [4] we have, 


1 1 ^ 
lim —D(x) = lim — > exp 

N^oo N N^oo N ^ 


N^oo N 4 

I expl-^ 


\\Xi - x\f 
2e 


= / 

Ja 


^exp^- 2 e— 


2 e 


p(x)(l + 0(e)) du = (Inef’^pix) + 


( 8 ) 


where the continuous integral is a result of taking Monte-Carlo limit over data sampled from the sampling density 
piy) with respect to the volume form dV that M inherits from the ambient space. The restriction of the integral to the 
tangent plane T^M was shown in [4] and follows from the exponential decay of the integrand and we also use the fact 
from [4] that dViy) = (1 + 0{e))du. Finally, the change of variables in ( 8 ) drops all the odd order terms due to the 
symmetry of the kernel. 

Recall that X was the matrix with columns Xj = D(x)~^^^ exp j (xj - x) = D(x)~^^^dxj and Y is the matrix 

with columns Yj = D(x)~^^^ exp](yy -y) = D(x)~^^^dyj. For any vectors v g M'” and w G M" we have. 


^ / llx • — xlp \ 

Urn w"" YX'^v = Hm Z)(x)“' ^ I ~ ~ 

j=l \ / 


= lim 

N^oo 


D(x)\~' 1 


N N 2 L 

' ./=i 




\\Xj - x|p 
2 e 


iy(Xj) - ‘J-l(x), w) (^Xj - X, v) 


= (2ne)-‘‘/^p(x)-\l+0(e)) j^cxpl-^^^^y^]{'H(y)-mx),w}{y-x,v}p(y)dV{y) 


= (27T£) 


-dl2 


X 


M 


exp 


26 


26 

D9i{x)u -h -u^H(9{)(x)u 0(6^), ({u, v) -l- {q(u), v)) (1 -h 0(6)) du 

(9) 


where Ff(-) is the Hessian operator and the last equality follows from using the exponential decay of the integrand to 
restrict the integral to the tangent plane (see [4] for details). For w g and v g TxM we reduce (9) to. 


lim w^YX'^v = (27r6) 

N^oo 


1 


exp - 


-dl2 

TxM 


26 


^ D'H(x)ijUjWiUkVk du 0(6^) = £ iyHix)ijWiVj + 0(d) 


ij,k 


= €w^D9{(x)v a 0 ( 6 ^) 


( 10 ) 


On the other hand, for w G M" and v g we reduce (9) to. 


lim w'^YX^v = (Ine) 


-dl2 


N^oo 


I I 


exp 


26 


y [H(f{i)(x)]i jUjU jWiqk(u)vk( 1 + 0(e)) du 




= (27Te)‘‘^^ f Texp(-^^) V [H(^i)(x)hjUiUjUaUi,wi[H(qi:)(0)]abVkdu + O(d) 

Am 4 \ 2 e 

= dy^ VkWiR<H(x)ik + 0(d) = d\dR^(x)v + 0(d), 

k,l 


( 11 ) 
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where we have used the expansion qkiu) = H{qk)({))u and we define 

R'H{x)!k = Y}Hmi)ix)h[mqk)mjj + [ hck ,)( x )]„[ h (^,)( o )]„- + [Hmi){x)]ij[mqk)mj)- ( 12 ) 

ij 

Finally, it is easy to see that for w g and v e T^M all the terms will be polynomials of degree 3 

in the coordinates of u, and since these terms are all odd, by the symmetry of the domain of integration we have 
limAT^oo w'^YX^v = 0{e^). Together with (10) and (11), the proof is complete. □ 

We note that the above proof can easily be generalized on kernels of the form K{e,x,y) = j for h : 

[0, 00 ) ^ [0, 00 ) having exponential decay by following [4]. In the remainder of this section, we will discuss the 
consequences of this result in more details. In particular, we shall see that the scaling law established in this theorem 
provides systematic methods to identify tangent spaces, estimate derivavtive as well as to estimate the kernel 
bandwidth parameter 6, which is crucial for accurate numerical approximation. 

3.1. Identifying Tangent Spaces with the Singular Value Decomposition 

The first method of leveraging Theorem 3.1 is with the singular value decomposition (SVD). Intuitively, the 
singular vectors will naturally be sorted into tangent vectors, with singular values of order ^[e, and orthogonal vectors, 
with singular values of order e. To see this we state the following corollary to Theorem 3.1. 

Corollary 3.2. Let Xi be samples from M with density p{x). Define X to be a matrix with columns Xj = 

D{x)~^l^ exp j {Xj - x) = D(x)~^^^dxj, where D{x) is defined as in Theorem 3.1. Then, 

lim -XX^ = I(xfl(x) + eRrix) + 0(e\ (13) 

AC—>oo 5 


where Rfix) e 

Proof. The proof follows from Theorem 3.1 with Pi{x) = x so that DPi{x) = Idxd and DTl{x) = I(xyDPI(x)I(x) = 
I(xyi(x). Note that the Hessian HifiH) in the definition of R<ji in (12) is with respect to the coordinates u e T^M, so 
in general Rj is not necessarily zero. In fact, by repeating the argument in the derivation of (12) one can show that, 

Rii.x)ik = \il\x)y{y][H(qi)mu[H(qk)mjj + [Hiqi)mtj[Hiqkmij + [H(qi)mij[H(qk)mji)l\x), 

iJ 


where I^(x) : M'” ^ T^M^ is a projection operator that is identity in the directions orthogonal to T^M and maps all 
vectors originating at v to zero when they are in TxM. □ 

Recall that I{x) : TxM is the projection onto the tangent space at v viewed as a subspace of W^. Corollary 

3.2 suggests that if v g TxM, then limAr^oo v^XX'^v = e\\v\^+0{e^), whereas for v g TxM^ we find limAr^cx) v^XX^v = 
e^v^Ri{x)v + (9(6^) = (9(6^||v|P). This shows that if v is a singular vector, the associated singular value. 


(Tv = lim 

N^oo 


dv^XX^v 


will either be order- Vi if v is in the tangent space, or order-e if v is orthogonal to the tangent space. Since the singular 
value decomposition of X finds v which maximizes o-y, when e is well tuned the first d singular values will all be 
order- ^fe and the remaining m - d singular values will be order-6. This fact gives us a way to identify the tangent 
vectors of the manifold by defining the scaling law, Of/, of a singular value, (T/, to be the exponential power such that 
(T/ cx 6^^ When ai ^ 1/2 then the associated singular vector is a tangent vector and when ai > 1 then the associated 
singular vector is orthogonal to TxM. 
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• Base Point 



Data Set 
Base Point 
►1st Singular Vector 
►2nd Singular Vector 
►3rd Singular Vector 





Figure 3: Data set sampled from a Torus embedded in (top) and with noise added (bottom). Singular vectors are shown (left) that correspond 
to the optimal choice of € (shown above with the solid dot in the scaling law curves, see Section 3.3) based on the empirical scaling laws (right) for 
the various singular values and the determinant of the weighted vectors X at the base point (1.996,0.126,1.000)^. 


For discrete data, this power law will change as a function of the bandwidth parameter e. Numerically, we can 
estimate this power law by computing cr/(6) for discrete values 6/ and then approximating, 

^ dlogjcri) ^ l0g((T/(6/)) - l0g((T/(6/-i)) 

dlogie) log(6i) - log(6,_i) 

We now demonstrate this numerically by sampling 10000 points (Ot, (pi) e [0, 27i]^ from a uniform grid and mapping 
them onto a torus embedded in M? by (x,y,z)~^ = ((2 + cos(6)) cos(0), (2 + cos(6)) sin(0), sin(^))^. We chose a point 

X = (1.996,0.126,1.000)^ and constructed the weighed vectors Xj = exp j (xj - x) for 6/ = 

where /=!,..., 230. For each value of 6/ we compute the three singular values of Xj and then we compute the scaling 
laws for each singular value. These scaling laws are shown in Figure 3. We selected the optimal value of e using the 
method that we will describe in Section 3.3, which are highlighted by a solid dot in the scaling law curves, and we 
plot the associated singular vectors in Figure 3. 

To demonstrate the robustness of this methodology to small noise in the ambient space, we repeated the experiment 
adding a three dimensional Gaussian random perturbation with mean zero and variance 0.04/3x3 to each point. In the 
noisy case, the theoretical scaling laws are obtained for a much smaller range of values of e as shown in Figure 3. In 
fact, when analyzed at a small scale (e < 0.15) all three singular values have scaling law ai ^ 1/2, which represents 
the three dimensional nature of the manifold after the addition of the noise. However, the scaling laws also capture the 
approximate two-dimensional structure, as shown by the scaling law of the third singular vector being very close to 1 
for 0.22 < 6 < 0.4. This suggests that the scaling laws are robust for perturbations of magnitude less than 6, however. 
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the singular vectors are more sensitive as shown by the slight tilt in the tangent plane defined by the first two singular 
vectors in Figure 3. 

3.2. Estimating Derivatives with the Linear Regressions 

We now return to the problem of estimating the derivative of a nonlinear mapping Ei : A\ c: ^ TV c M" where 

we assume that we know the values of E( on our training data set yi = El{xi). As mentioned above, the approach of 
[3] was to use a linear regression to estimate D9i(xi) as the matrix which minimizes 2; Myj - Di3(xi)dxj\\^. Using 
the theory developed in Section 3.1 we can now rigorously justify this approach. Notice that the linear regression 
minimizes the error Y ^ D93(Xi)X by setting Di3(xi) = YX^(XX^)~^ (where the additional factor of D(x) from 
Theorem 3.1 cancels making this equivalent to the approach of [3]). 

Theorem 3.1 suggested that a simple method of estimating the derivative 093(x) is with the correlation matrix 
^YX~^. Numerically, we found that a better estimate of D93{x) is given by the linear regression YX'^(XX'^)~^, and 
we also analyze this construction. From Corollary 3.2 we have limAr^oo XX^ = el^xYI{x) + e^Ri{x) + 0(6^), which 
implies that in the limit of large data. 


lim (XZ"")-* = y{I{xyi{x)f - eRiix) + 0(e^)), 

N^oo e 

where t denotes the pseudo-inverse. Combining the results of Theorem 3.1 and Corollary 3.2 we have, 
lim YX^iXX^)-^ = {UHix) + eR^{x) + Oie^miixf I{x)f - eRi{x) + 0{e^)) 

N^oo 

= Dii(x)(I(xV I(x))^ + 0{e) 

= If^(9{{x)Vcm(x)I(x)(I(xVl(x))^ + 0(e) 

= Ifj(<H(x))^D^(x)(I(x)^)^ + 0(e) 

This implies that the regression based estimate of DEl^x) can have large errors in directions orthogonal to the tangent 
space TxM. However, these large errors are not important when 093(x) is used in constructing a local kernel, since 
the local kernel construction only depends on the projection of D93(x) onto the tangent space. The likely reason that 
the linear regression, YX^(XX^)~^, gives better results than the correlation estimate, ^YX^, is that the errors arising 
from the approximation of the continuous integrals by the finite summations in YX^ and XX^ are correlated, similar 
to the result found in [8]. Finally, we note that the columns in Y require the value of y = 93(x), which is assumed to be 
known in the training data set, but will not be known if we wish to extend the map D93 to a new point v*. However, this 
can easily be overcome by converting from a linear regression to an affine regression, which will implicitly estimate a 
weighted linear regression for 93(x*). 

We demonstrate this method of estimating derivatives by defining the function, 

93(x,y,z) = + z, 

which restricted to the torus can be written in the coordinates (6, 0) as, 

93{x,y, z) = 93(6, (p) = (2 + cos(^))^ cos(0) sin^(0) -i- sin(^). 

We evaluate 93 on the data set lying exactly on the torus example in Section 3.1. We will evaluate the derivative 
D93(x,y,z) = (y^,x,l) by projecting onto the two tangent directions and and the orthogonal direction 

X . In Figure 4, we compare the contour plot of the analytical derivatives (first column) to the corresponding 
estimates obtained by the linear regression (second column) and the covariance matrix (third column). Notice that 
the correlation matrix estimate ^YX~^ ^ D93 is approximately zero when projected in the direction orthogonal to 
the tangent plane, whereas the linear regression estimate YX~^(XX~^)~^ recovers the analytic derivative even in this 
orthogonal direction. We re-emphasize that when used in a local kernel, the behavior in the orthogonal direction is 
irrelevant to the limiting operator. 
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Figure 4: Contour plots of derivatives D^(x, y, z) (^p), D9k(x,y,z)^^^^^ (middle), and D9k(x, y, z) ( x (bottom) are shown 

for the analytical computation (first column), regression estimate (second column) and covariance estimate (third column^ 


3.3. Tuning the Local Bandwidth via SVD 

A significant challenge in applying kernel-based methods such as diffusion maps and local kernels is tuning the 
bandwidth parameter e. The algorithms of 14, 3] are based on a global bandwidth parameter, meaning that the same 
value of 6 is used for all data points. In 15] a method was introduced for tuning the global bandwidth parameter based 

on the scaling law in (8). As pointed out in 15], when e is well chosen, the kernel j will localize the 

integral over the whole manifold onto the tangent plane. This localization is made rigorous up to an error of order-6^/^ 
in Lemma 8 of 14]. Thus, when e is well-tuned we expect to see the scaling law D{x) oc On the other hand, in 
the limit as 6 ^ 0 we find D{x) ^ ^ Yli=\ 0 = 0 and in the limit as 6 ^ oo we find D{x) ^ ^ 1 = 1. When using 

a global bandwidth, the approach advocated in 15] was to average D{x) over the dataset, and to choose bandwidth 
parameter e so that 



Of course, this method of tuning the bandwidth parameter requires knowing the intrinsic dimension d of the manifold 
M. In 15] they advocated choosing e such that log(D(6)) | log(6) -f c is approximately linear as a function of log(6). 
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In [2], an extension of the method of [5] was advocated that simultaneously determines the bandwidth parameter 
6 and the intrinsic dimension d. The approach of [2] is based on the scaling law S (e) defined by, 




dlog(D) 
dlogie) ’ 


and noting that when 6^0 and 6 ^ oo we have S (e) 0. We should note that the limit 5 ( 6 ) ^ 0 as 6 ^ 0 applies 

only to the biased estimate D(xj) where the summation includes i = 7 , meaning that the largest summand is always 1. 
The largest summand being 1 implies that the other summands will lose numerical significance as 6 ^ 0, meaning D 
converges to a constant and S (e) 0. If the unbiased summation of D(xj) were used (for example in a Kernel Density 

Estimation) then as 6 ^ 0 the summand corresponding to the shortest distance would dominate, so that D oc exp(-c/ 6 ) 
and S (e) = oc in the limit as 6 ^ 0. However, in this paper we restrict our attention to the biased 

estimate, as required by the diffusion maps and related algorithms, so that as 6 ^ 0 we have S (e) 0. This implies 

that S (e) has a unique maximum, and in [2] they chose e to maximize S (e) and then set the dimension by, d = 2S (e). 
The approach of [2] was found to be ineffective for kernels with a global bandwidth parameter, especially when there 
are large variations in the sizes of local neighborhoods due to the sampling of the data set. However, the method of [2] 

was found to be very robust for a variable bandwidth kernel of the form exp j where the bandwidth function 

p(x) was chosen to be inversely proportional to a power of the sampling density, namely p(x) oc p(x"f for JS <0. 

From ( 8 ), we should have a scaling law D(x) oc in each local region. We can now connect this fact to the 
scaling laws of the singular values shown above. Recall that X has d singular values equal to ct/ = + 0{e), 

I = 1 ,...,d and the remaining n - d singular values are order- 6 . Thus, we have trace(XX^) = = de + 0{e^) so 

that ^trace(XX^) = d + 0{e). Since the trace is independent of the order of multiplication, we can define v = (26)“^ 
so that d log y = ^ = -^ = -d log e and write. 


-trace(XX^) = 2 ytrace(X^X) = 


2 y 

W) 


^ exp [-v\\xi - xf)\\xi - xf 




Dix) 


- 2 y d 
D(x) dv 


y^exp(-2v||x,-x|p) 


-2y dD{x) ^d\ogD{x) 
D{x) dv d log 6 


The previous equation confirms that the scaling law of D{x), given by. 


^1(6) 


d\og{D{x)) 

d\og{e) 


should be equal to J/ 2 , so one method of estimating the dimension for a given value of e would be. 


Ji(6) = 25i(6), 

and this formula uses the singular values by implicitly taking the trace of the matrix XX^. This formula was previously 
known based on the fact that D{x) oc 6 ^^^, which comes from the normalization factor for a Gaussian on T^M. 
However, the connection to the sum of the singular values reveals that when the ambient space dimension, m, is large, 
the singular values ct/ for / > J can lead to overestimation since. 


fTl 

di{e) = 25 1 ( 6 ) = -trace(XX^) = d+ 

^ i=d+i 

Of course, each ct/ is order- 6 ^ for I > J, however, when m is large enough, this summation can lead to significant 
overestimation. We note that the coefficients of these order- 6 ^ singular values depend on the curvature of the manifold 
at the point x, and these coefficients can be large for complex geometries. This shows how the value of 6 , which 
maximizes the local scaling law S 1 ( 6 ), as suggested in [2], can overestimate the dimension. 

Here, we introduce a new method that combines the ideas of [5, 2] with the local SVD in order to tune e in each 
local region and improve approximation of the tangent space. Recall from Section 3.1, in a local region of x e M, we 
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Figure 5: Singular values as a function of e for a high curvature embedding of a torus into (left) and the same data set perturbed by 30- 

dimensional additive Gaussian noise with mean zero and covariance matrix ^/soxso (right). 


define the matrix of weighted vectors, X, with columns, 

Xi = exp I (Xi - x). 


Letting ct/ be the singular values of X, when e is well tuned the first d singular values obey the scaling law cri oc ^/e 
and the remaining m-J singular values (where m is the ambient space dimension) are higher order, namely ct/ = 0(6). 
Notice that the m - J singular values which are 0(6) are not necessarily proportional to 6; indeed they can be exactly 
zero in the case of a linear manifold such as a plane embedded in M?. One strategy would be to threshold the singular 
values, however, by adding a small amount of noise to the data set in the ambient space, we can easily produce singular 
values which are greater than 6. We illustrate these issues in Figure 5 by embedding a torus into where the first 
three coordinates are the standard embedding of the torus and the remaining 27 coordinates results from applying 
a randomly-generated orthogonal transformation to the first three coordinates raised to the third power and divided 
by 30. Cubing the coordinates results in a high curvature embedding, which leads to large constants on the 0(6) 
bound on the singular values corresponding to singular vectors that are orthogonal to the manifold. The orthogonal 
transformation generates a nontrivial embedding into and the addition of Gaussian noise makes this a highly 
complex embedding of an intrinsically simple data set. In Figure 5 we show the singular values for the clean and 
noisy 30-dimensional embeddings. Notice that thresholding singular values less than 6 may be effective when the data 
lies exactly on the manifold (left), however the high curvature can result in nontrivial constants in the 0(6) bound. 
The addition of noise implies that the dimension of the manifold is greater than two for some values of 6 (for example 
6 ^ 10“^). For 6 G [2 X 10“^, 10“^] the third largest singular value is larger than 6 but does not obey the scaling law 
6^/^. While thresholding alone cannot detect the two-dimensional structure, the scaling laws reveal the true dimension 
of the manifold. 

To incorporate the scaling laws of the singular values into the tuning of 6 and the dimension estimation, we 
introduce the following measure of dimension. 


floor( ) 

d2ie) = 2 Yj 

l=l 


dlog(cri) 

dlog(6) 


2(di - fioor(Ji)) 


Jlog((Tfloor(Ji)+l) 

d\og(6) 


Notice that when d\ is an integer, the second term is zero, and the summation is simply the sum of the first di scaling 
laws. If the first di singular values correspond to tangent vectors, then the associated scaling laws should be 1/2, and 
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in this case, we would find d 2 = 2 1/2 = di. More generally, we can see that the summation can be rewritten as. 


floor(t/i) 

^ i: 

i=i 


dlogjcri) 
d\og(€) ~d\og(€) 


= 2 - 


log 


di 

n 

,/=l 


cri 


which reveals this second dimension to be related to the determinant since it comes from a product of singular values 
(as opposed to di , which comes from a summation of singular values). The final term is included so that J 2 is a smooth 
function of e. 





Figure 6: Top Row: Dimension measures di (blue) and d 2 (red) as functions of the bandwidth e at the base point (1.996,0.126,1.000)^ corre¬ 
sponding to the data set sampled from the torus (left) and the noisy torus (right) shown in Figure 3 of Section 3.1. The metric of agreement, M(€), 
is shown as the dotted black curve. The solid black dot represents the bandwidth that minimizes the metric along with the average dimension at the 
optimal 6. Bottom Row: Same curves for the 30-dimensional high-curvature embedding used in Figure 5 (left) and with 30-dimensional Gaussian 
noise (right) 


For each value of e we now have two estimates the dimension, and when e is well-tuned these two estimates of 
the intrinsic dimension should agree, so we choose e to minimize the relative disagreement where we set the 

intrinsic dimension to be, 

<^ave(^) = {di{e) + d2(€))/2. 

A slight complication is that the curves diie) and d 2 (€) can intersect multiple times, as shown in Figure 6. In order 
to ensure that the scaling laws are stationary at the intersection point, we would also like to minimize the derivatives 
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I I ^dioge \ Thus, as a practical method of choosing 6, we minimize the metric, 


di{e)-d2{e) 

L 

Jlog di 

L 

Jlog ^2 

dm yd) 


dlog 6 


Jlog 6 


where the derivatives are numerically discretized. 

We demonstrate this method of tuning the bandwidth e on the example in Section 3.1 and the results are shown in 
the top row of Figure 6. We also applied this method of tuning the bandwidth to the 30-dimensional high curvature 
embedding from Figure 5, and the results are shown in the bottom row of Figure 6. The optimal bandwidth shown in 
the top row of Figure 6 was used to plot the singular vectors in Figure 3 above. 

We should note that there are many other approaches one could take to estimate the intrinsic dimension of man¬ 
ifolds using the facts introduced in this section. In particular, there are many thresholding methods that could be 
applied to find integer dimensions. Motivated by applications to noisy and fractal data sets (which fall outside of 
the current theory) we have developed a non-integer measure of dimension based on scaling laws. Moreover, notice 
that in Figure 1, different parts of the manifold contract at different rates, so that the dimension of the manifold does 
not appear constant. As a result of this, in Section 4 and in Appendix A we will use the rescaled diffusion map¬ 
ping O of Section 2.1 with a locally determined dimension d{xi). Whichever method is used to estimate dimensions, 
the examples in this section show that both the magnitudes and the scaling laws of the singular values should be 
incorporated. 

A significant drawback of the method of tuning the bandwidth 6, introduced in this section, is that computing d 2 {e) 
requires computing the singular value decomposition of the weighted vectors X, for every base point and a large range 
of bandwidth parameters. Due to the increase in computational complexity, in all of the examples below (and in the 
algorithm of Appendix A) we use the simple method of maximizing di to choose the bandwidth. We suspect that this 
method of choosing the bandwidth is sufficient for the examples below due to low curvature embeddings with small 
noise, so we included this new method of tuning e to demonstrate a robust tuning method for more complex data sets. 

4. Iterated Diffusion Map (IDM) 

In this section we consider representing general maps 7T that can take data in high-dimensional spaces to lower¬ 
dimensional spaces, generalizing the result in [3] that was reviewed in Section 2.2. In particular, we will make use 
of Theorem 2.1 to find an isometric embedding of M with respect to the appropriate geometry such that these new 
embedded coordinates emphasize the feature of interest 9d{M) = TV. In analogy to the diagram in Section 2.2, we 
shall see that the proposed method represents 7T with a linear map between the iterated diffusion mapping of the data 
manifold M and the rescaled diffusion coordinates of feature space TV. 

One of the challenges is that the result in [3] is not immediately applicable since 7T is not assumed to be a 
diffeomorphism, and therefore the kernel constructed in (5) from is not necessarily a local kernel. To see this, 
we can define a covariance matrix C{x)~^ = Di^(xy D9^(x), where Dik{x) is the local derivative in the ambient 
space estimated by linear regression as discussed in Section 3.2. If we naively form the kernel K{e, x, y) from (3) with 
covariance matrix C(x), then this will not be a local kernel. The problem is that the restriction of C(x)~^ to the tangent 
plane, c(x)~^ = I(x)C(x)~^I(xy = D'Hixy D'Hix), may not be full rank since the map 7T may take the manifold At 
to a lower-dimensional manifold 9d{M). If c(x)~^ is not full rank, then there exists a nontrivial vector u e T^M such 
that c{x)~^u = 0 (in fact c{x)~^u = 0), so if y - v = (w, q{u)) we find K{e, x,y) = 0(1), which means that K does not 
have the exponential decay, so K is not a local kernel (see Section 2.2 and [3]). 

Often the kernel K is constructed using the k nearest neighbors, so that K(€, v,y) = 0 by definition when y is not 
in the list of the k nearest neighbors of x, and vice-versa. When the k nearest neighbor algorithm is used, technically 
the kernel K constructed with a rank deficient covariance matrix is still a local kernel since the kernel still has an 
implicit decay that can be bounded above by an exponential function. However, the localization caused by the k 
nearest neighbor algorithm has a very sharp cutoff such that the corresponding operator approximated by the kernel is 
very sensitive to the choice of k. 

In order to use the local kernels theory to represent the feature map 7T, we propose a novel algorithm called 
the iterated diffusion map (IDM). The IDM will make use of local kernels which use small perturbations of identity 
covariance matrices such that Theorem 2.1 is applicable on each iteration. In Section 4.1, we present the IDM and 
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show that it is a discrete approximation of an intrinsic geometric flow. In Section 4.2, we show that if the data 
space At is a product of the feature space and the irrelevant space, then IDM will produce a quotient manifold that 
is isometric to the feature space, eliminating the irrelevant dimension. Finally, we will show numerical results with 
IDM in Section 4.3, highlighting its advantages and limitations. The numerical algorithm of the IDM is outlined in 
Appendix A. 


4.1. IDM as an Intrinsic Geometric Flow 

We now introduce the IDM algorithm for feature identiflcation. The method assumes the availability of a pair of 
data sets x/ e At c and yt = Fl(xi) e N c M", where 7T is not assumed to be a diffeomorphism and N may 
even be lower dimension than At. With this training data, we apply the linear regression method in Section 3.2 to 
approximate the local derivative D94 in the ambient space which is subsequently used to deflne a new covariance, 

C^mix) = ((1 - T)I„xm + tD<H{xVD iHix)]"', (14) 

where \mxm is the m x m identity matrix. Notice that with this construction, C^(o)(x) is guaranteed to be positive 
deflnite, even when DFl^xYDFl{x) is not a full rank matrix (where the relation of DFl and 044 is deflned in (6)). 
With the deflnition in (14), we implicitly deflne a map Q \ M ^ M such that DQ{xyDQ{x) = C<}i{o){x)~^. When 
T 1, intuitively, ^ is a small perturbation of an identity map on At since 

DQix) = Imxm - ]^T{piH{xyUHix) - I„xm) + 0{t^). 

Unlike the sharp decay due to the k nearest neighbor cutoff, (v) achieves a smooth decay even in directions where 
D4-((x)D4-{(xy is rank deficient. Using the prototypical kernel, 

K(€, x,y) = exp (-(y - x)^C<H(o)(x)“^(y - x)/2 ^, 

along with the construction in Theorem 2.1, we approximate the operator which is the Laplace-Beltrami operator 
with respect to the Riemannian metric, 

= ((1 - T)W + - t)w + D^^iyHy\ (15) 

where c^(o){x) = J(x)C.^(o)(x)J(x)^. Notice that if we build a diffusion map = (e^^'(fi(x), ...,e^^‘^(fM(x)y = 

x^^^ using the eigenfunctions of (approximated by the local kernel construction) this gives an approximately 
isometric embedding of At with respect to the metric ^<^(0). for small enough parameter 5'. Moreover, the new metric 
g^(o) in (15) puts a larger weight on directions in which 044 are large, which are the direction associated with the 
range space of 44. 

The key point that makes the iterated diffusion map useful is that the local kernels with covariance deflned below 
(cf. (18)), change the geometry, as opposed to iterating the diffusion maps using identity covariance, C(x) = Imxm^ as 
discussed in Section 2.1. In particular, the Ath iteration is performed on the coordinate 

/e-i) = { = 2,3,..., (16) 

where s x, with induced feature maps M defined as follows, 

= { = 2,3,.... (17) 


Numerically, we approximate the local derivative ^ of in the ambient space by the linear regression 

method in Section 3.2. In this particular implementation, and in Theorem 3.1 are defined as. 


= D(x(^“‘>)“‘/^ exp 


exp 


11 ^' 


(^-1) _ 


4e 

4e 


ix- 


.(e-i) _ j{-iK 


(yj-y). 
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where := (Of o of o... o (^f^)(xj) for € >2. Given we define local kernels induced by covariance 

matrices, 

= ((1 -t)Imxm + , { = 2,3,.... (18) 

We can now repeat the local kernel construction above using the covariance to produce eigenfunctions 

and eigenvalues of A c^-d and obtain x^^^ = For 5' sufficiently small, the new coordinates will be an 

approximately isometric embedding of At with respect to the metric, 

_ ^-1/2 ^ -1/2 _ -112 -1/2^ -1/2 -1/2 .-lox 

~ ^(j^i{-\) ~ ^<}-[{{-V) ’ ’ ’ §M (1^) 

where c^(;)(x) = I{x)C<ji{j){x)I{xY for 7 = 0,...,^-!. Each iteration of the diffusion map further emphasizes the 
directions on the manifold At, which are important to the function *K. Moreover, the map *K(x) is a fixed point of the 
iterated diffusion map process. To see this, assume that for some k we have = 1~l{x), then we find 

^^^\^{x)) = = ^(x), 

so which implies that = Idxd such that c<j^{k) = l^xd so g^iik+D = gq^{k) and therefore = 

^{k) _ {}{(^x). Similarly, any isometric embedding L;^{9i{x)) is a fixed point of the iterated diffusion map. To see this, 
assume x^^^ = iM{^{x)) and note that = "Hix) so that = D^. Since is an isometric 

embedding, we have Dlj^ = I and this implies that = /. It remains an open question whether this fixed point is 

attracting in the general case, and further analysis is needed to understand this issue. 

One possible interpretation of the IDM is as a discretization of a geometric flow. To see this, we define, 

C(x, f + T) = ((1 - + TiyH{x{t)yUHixitY > 

as a continuous analog of (18), where x{t) is an isometric embedding of (At, g{t)) where g(0) = gM and with a feature 
map defined continuously 7T(x(0) = 7T(x(0)), where x(0) = x to mimic the discrete setting in (17). The new metric 
introduced by c(x, t + t) would be, 

g{t + r) = c(x, t + g{t)c{x, t + 

= (I + t(D^^D 9{ - l)Y^g(t)(l + t(D^^D9{ - I))^/^ 

= g(t) + 2(^D^^D^g(t) + g{t)UH^UH - 2g{t)) + 0{d). (20) 

Rewriting the previous equation we find, 

^ = lim ~ sd) ^ Uim^img + gim^im), (2i) 

at T^o T 2 ^ ^ 

which is an equation describing an intrinsic geometric flow. Notice again that if = /, then g is an equilibrium 
solution of (21). We should note that the geometric flow in (21) is nonlinear since the map 7T depends on the metric g 
in nontrivial fashion (since 7T maps an isometric embedding of (Al, g(t)) to the feature of interest in 7T(A1)). This is 
the reason why it is not straightforward to see whether there are other equilibrium solutions or even to determine the 
stability of any equilibrium solution. The standard linear stability analysis suggests that if g* is the fixed point of (21), 
then g* is locally attracting when the real part of all of the eigenvalues of the linearized operator Dg(D^^D^g + 

gD9 {^is less than 2. 

4.2. IDM for Product Manifolds 

We now consider a simple case where the manifold Al is a product space M = N xP, such that N is the feature 
space and P contains variables we wish to ignore. In this case, the map 44 : M ^ M has a particularly simple 
structure. In each local neighborhood we can find coordinates x = (y, z) e Al where y are coordinates on TV and z 
are coordinates on P. In these coordinates the metric g will naturally decompose into a block diagonal matrix. The 
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first block represents the metric on N and this block is dj^ x and the second block represents the metric on 
!P and this block is dp x dp. Since "H^M) = N maps each point to the feature of interest, in these local coordinates, 
the feature map will take the form 'H(x) = "Hiy^z) = y. Moreover, in these coordinates, D^(x) is a block diagonal 
matrix where the first dj^ x dj^ submatrix is the identity matrix and the remaining entries are all zero. So we find 
that - / is again a block diagonal matrix, where the bottom dp x dp block is equal to -/ and the remaining 

entries are zero. Writing the geometric fiow (21) in these coordinates we find. 



which implies that = 0 and gp = -gp. This shows that for product manifolds of the geometric fiow (21) will 
contract the irrelevant variables to zero and leave the features of interest unchanged. So for sufficiently small dis¬ 
cretization T and in the limit of sufficiently many iterations, the IDM will construct the quotient map from the product 
manifold to an isometric copy of the feature space. At this point, the data can easily be mapped to the feature space 
using the method of Section 2.2. In fact, since the quotient manifold is already isometric to the feature space, one 
could simply estimate a linear map between the rescaled diffusion coordinates of the quotient manifold and those of 
the feature space (since these coordinates are canonical up to rotation as shown in Section 2.1). In analogy to the 
diagram in Section 2.2 which represents a diffeomorphism, we can summarize the IDM construction of the quotient 
map with the following diagram. 


M = NxP 


'K 

- > 


N 


'P=lim^^oo,.^o of o-oOf 


L\N, g) M" —-—» L^{N, gjv) ~ R" 

where 'T represents the iterated diffusion map, O are the rescaled diffusion coordinates of TV. The above diagram 
shows how 9-f is represented by an orthogonal linear transformation HVm9d = o ^ o 'F. 

Moreover, consider the case when M = N xP, but the feature of interest is ‘F(TV), where is a diffeomorphism. 
In this case, the block diagonal structure DFf and of (22) will still hold, and in particular we still find gp = -gp. 
This shows that the fiow still contracts the irrelevant variables P to zero, and the only difference is that we will find 
gN = ^ {{DT^DT - I)gM + DT - /)). Notice that the fixed point for this fiow satisfies DF = /, so we 

expect in the limit to obtain an isometric copy of TV. However, even if the fiow on gj^ has not converged, once the 
IDM has contracted the irrelevant variables P, we can use the construction in Theorem 2.1 to represent the final 
diffeomorphism between 'F(AI) and the feature space F(TV). In the next section we demonstrate the IDM on two 
product manifolds, namely the annulus and the torus. We will also attempt to apply the IDM to manifolds which are 
not product manifolds and report the empirical results. 

4.3. Examples 

In this section we will demonstrate how the iterated diffusion map is able to contract a manifold onto a lower¬ 
dimensional feature of interest. All the examples use M = 25Q rescaled diffusion coordinates. We found the results 
to be robust down to around Tkf = 100 rescaled diffusion coordinates and no improvement above M = 250. In the 
examples below we adjusted the parameter r e (0,1), which defines the discretization of the geometric fiow in Section 
4.1, in order to achieve the desired feature in about four iterations of the diffusion map. In principle, one would like 
to take T as small a possible, however this requires many iterations that are computationally intensive. Also, we have 
found that numerical errors can accumulate over large numbers of iterations, which we discuss in the Section 5. For a 
compact description of the numerical algorithm, see Appendix A. 

The first example is the annulus described in Section 1 which used r = 0.65. The annulus is a product space 
A = 5 ^ X [1,3] and in Figure 1 we show the iterated diffusion map recovering the radial component. If we parameterize 
the annulus with polar coordinates {6, r) g [0, In) x [1,3], then the feature of interest in Figure 1 was the coordinate 
r, which shows that the iterated diffusion map is able to change the topology of a manifold (both the dimension and 
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Figure 7: Top: Original annulus data set colored according to the feature of interest (leftmost), followed by four iterations of the diffusion map 
using the local kernel defined in Section 4 with r = 0.3. Each diffusion map shows the first three rescaled diffusion coordinates colored according 
to the feature of interest (the angle of the data point in the original annulus). However, 250 rescaled diffusion coordinates are maintained at each 
step. Bottom: Original data set showing the 200 nearest neighbors (red) of the blue data point, where the neighbors are found in the corresponding 
iterated diffusion map embedding. 


the number of holes are changed in Figure 1). Notice that although both the source and target manifolds are less than 
three dimensional, the iterated diffusion map must move through a three-dimensional embedding in order to transition 
between these very different geometries. Indeed, the first application of the diffusion map (with the local kernel 
described in Section 4.1) shown in Figure 1 transforms the geometry from an annulus to a cylinder. Intuitively the 
cylinder introduces a new variable, height, to represent the feature of interest. This is shown by the coloring in Figure 
1 which represents the radius of each point on the original annulus, and varies only with the height of the cylinder. As 
the diffusion map is iterated, the geometry evolves as described in Section 4, intuitively putting more emphasis on the 
direction (namely the height) which contains the radial information. This is manifested as the circle component of the 
cylinder contracting until the data set becomes a line, thereby representing only the feature of interest as shown by the 
coloring. 

We now show that the IDM can also contract the annulus onto the other natural feature of interest, namely the 
angle. However, note that the single parameter 6 e [0, 27r) is not an embedding of the circle, since the periodic 
boundary conditions cannot be satisfied in Instead, to recover the circle from the annulus, the feature of interest 
is the two-dimensional feature (sin(^), cos(^))^, which is an embedding of the circle. In Figure 7 we show the results 
of applying the iterated diffusion map to the annulus with the feature 7T(^, r) = (sin(6), cos(6)y. 

Next we consider a simple example where the manifold is a torus T^ = S^xS^ with intrinsic coordinates (6, (p) e 
[0, In)^ with periodic boundary conditions. Since the torus is a product of two circles, parameterized by 6 and 0, 
respectively, we can consider either of these circles as a lower dimension feature of interest. For example, when the 
desired feature is the circle parameterized by 6, the feature valued function is 7T(v, y, z) = (sin(6), cos(6)y. As shown 
in Figure 8, when the feature of interest on the torus is either of the circles in the product structure, the IDM evolves 
the manifold by contracting the irrelevant circle until only the feature of interest remains. Notice that the IDM is able 
to destroy topological features such as holes in pursuit of the feature of interest. 

Finally, we consider a manifold that is not a product space, namely a 2-dimensional unit sphere, and we first 
choose the feature of interest to be simply the v-coordinate of the sphere and the results for this example are shown 
in Figure 9. We also consider a sphere with a more complex feature that twists up the sphere, as shown in the third 
row of Figure 9. While in these cases the geometric flow cannot be described by the simple product formula in (22), 
the flow still emphasizes the feature of interest and seems to contract the manifold onto a lower dimensional manifold 
that better represents the feature. 
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Figure 8: Top Row: Original data set colored according to the desired feature (leftmost) followed by four iterations of the IDM with the feature 
of interest given by z) = (sin(0), cos(0))^ and r = 0.4. Second Row: Original data set showing the 200 nearest neighbors (red) of the blue 

data point, where the neighbors are found in the corresponding iterated diffusion map space from the top row. Third Row: Original data set colored 
according to the desired feature (leftmost) followed by four iterations of the IDM with the feature of interest given by ‘H(x,y,z) = (sin(0), cos(0))^ 
and T = 0.65. Bottom Row: Original data set showing the 200 nearest neighbors (red) of the blue data point, where the neighbors are found in the 
corresponding iterated diffusion map space from the third row. 


5. Conclusion 

The above results show that for intrinsically low-dimensional data sets, an iterated diffusion map can help identify 
features that are hidden in the geometric structure of the data. From a geometric point of view, the IDM approximates 
a geometric flow that stretches directions on the manifold that are locally correlated to the desired feature and contracts 
directions that are locally uncorrelated with the desired feature. When the data manifold is a product of the feature 
manifold and the irrelevant variables, the geometric flow (21) reduces to (22), which stably contracts the irrelevant 
variables to zero. So for product manifolds, the IDM constructs the quotient map from the product manifold to 
the feature manifold. For more general manifolds, this geometric flow appears empirically to converge to a lower¬ 
dimensional manifold that better represents the feature of interest. 

Several key tools are necessary for the construction of the IDM. First, as shown in Section 3, one needs to be 
able to estimate the local derivatives of a nonlinear map between manifolds embedded in Euclidean space using only 
empirical data. Second, the construction of Section 4 is necessary to form a local kernel that satisfies the requirements 
of Theorem 2.1. Finally, the rescaled diffusion mapping of Section 2 is needed to give an isometric embedding of the 
new geometry introduced by the local kernel. With these three pieces in place, it is possible to iterate the diffusion 
map in a way that approximates a geometric flow and emphasizes the variable of interest. 

However, several important problems remain unsolved. First, we found empirically that applying too many itera- 
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Figure 9: Top: Original data set colored according to the desired feature (leftmost), followed by four iterations of the diffusion maps is the local 
kernel defined in Section 4 with r = 0.7. Second Row: Original data set showing the 400 nearest neighbors (red) of the blue data point, where the 
neighbors are found in the corresponding iterated diffusion map space. Third Row: Original data set colored according to a more complex desired 
feature (leftmost), followed by four iterations of the IDM with the feature of interest given by ‘H(x,y,z) = sin(7rz/2 + tan“^(y/x)) and r = 0.6. 
Bottom Row: Original data set showing the 400 nearest neighbors (red) of the blue data point, where the neighbors are found in the corresponding 
iterated diffusion map space from the third row. 


tions of the diffusion map lead to apparent numerical instability. We suspect that this problem arises from accumulated 
numerical error in the repeated eigensolves required to find the diffusion mappings. Second, the exact criterion for 
the convergence of the iterated diffusion map to the desired feature requires a better theoretical understanding of the 
geometric flow of Section 4, such as its attracting set and the stability of the equilibrium points. Finally, the current 
algorithm requires the entire manifold to be well-sampled, which means that the data requirements depend on the 
dimension of the data set and not simply the feature set. Intuitively, it may be possible to extend the IDM to points 
that lie in sparsely-sampled regions of the data set, as long as these regions are well-sampled in the feature space. 
Ideally, this could reduce the data requirements to only depend on the dimensionality of the very low-dimensional 
feature set. However, it is unclear how to extrapolate the IDM into these sparsely-sampled regions of data space, and 
currently the need to estimate the local derivatives requires fairly dense sampling everywhere on the data manifold. 
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Appendix A. Numerical Algorithm for the Iterated Diffusion Map 


Given a data set c At c M'” and a feature [ji = c TV c M", the algorithm of this section can be 

used to construct an embedding which emphasizes the feature of interest. We assume that xi are points in 
which are sampled on (or very close to) a J-dimensional manifold M, and that the feature of interest yi = 9i{xi) lives 
on a manifold TV which has dimension less than or equal to d. The algorithm also produces a basis of eigenfunctions 
[ij/j{xi)}^^^ that can be used to represent the map 7T and extend this mapping to out-of-sample data points v with 
standard methods such as the Nystrom extension. 

The first part of the IDM is a generic algorithm for estimating the derivative D^{xi) at each point. The step-by- 
step algorithm is outlined in the first box below. Optionally, if the embedding of M has high curvature or the data is 
noisy, a more robust method of tuning the local bandwidth parameter may be used by adding the following substeps 
to Step 4. and then replacing Step 5. with “Set ^ = argmin(Tkf) and d(i) = Jave(^)-” 


(e) 

(f) 

(g) 

(h) 

(i) 
(i) 


Form the ^ X m matrix X of weighted vectors with j-th column Xj = y ^(v/(y) - Xi) 

Compute the singular values (Ti, ..., of X and store them in s(^, j) = ctj for j = 1,..., m 
If ^ > 1 Compute the scaling law of each singular value a(^ - I, j) = \ 

lff>l Set Jo = floor(Ji(^ - 1)) and compute J 2 (^ - 1) = 2 - 1, 7 ) + 2(Ji - do)a(£ - 1, Jo + 1) 

If ^ > 1 set Jave(^ - 1) = (di(£ - 1) -f d 2 (£ - l))/2 

Tf / ^ 1 QPt Mf/ - 1 J - I di(l-l)-d2(l-l) I , I log((ii(f))-log((ii(^-l)) I I log((i2(f))-log(<J2(^-l)) I 
11 t ^ i i) - I \ ^ \ log(e(U)-log(e(^-l)) | | log(e(U)-log(e(f-l)) | 
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Algorithm 1: Estimating iyH{Xi) with Auto-tuned Bandwidth 

Inputs: Data sets c R"* and c R". Parameters: number of nearest neighbors, k, and number of discrete values 

of the bandwidth to consider, L. 

Outputs: At each point Xt the algorithm returns the local estimates of the derivative D9^(xi) and the intrinsic dimension d(i) 
and the sampling density q(i) = q(xi). 

For each / = 1,A 


1 . 

2. 

3. 

4. 


5. 

6. 

7. 

8 . 
9. 

10 . 

11 . 

12 . 


Find the /:-nearest neighbors of Xf in R"*, let their indices be I(j) (where 7(1) = /) for j ordered by increasing 

distance and let d(j) = \\xi - A/(y)||. In Section 4.3 we used k - 500. 


Tune the local bandwidth e using steps (c)-() 

Define Emin = d(2)l(2\og Emach) and = I0d(k) 

For 7= 1,...,L 

(a) Let 6(7) = exp (log(6min) + (7/L)(log(6inax) - log(6min))) 

(b) Compute the weights wj - exp (^-d(j)^/{2€{£))^ 

(c) Compute the sum 79(7) = Y!j=i 

(d) If 7 > 1, compute di{£ - 1) = 2- 
Set 7 = argmax(<ii) and d(i) 

Set 6 = 6(7 + 1) 

Compute the weights wj = exp [-d{j)^ l{2e)^ 

Compute the sum D 


> log(Z)ffl)-log(Z)(^-l)) 
' log(6(^))-log(6(^-l)) 

^ dm 




D 


Set q{i) 

Form the kxm matrix X of weighted vectors with j-th column Xj 



Form the kxn matrix Y of weighted vectors with j-th column Yj 


Compute the m X fz matrix D9^(i) using the linear least squares regression - (X^X)"^X^F 


To compute the Iterated Diffusion Map (IDM), we will iteratively construct where x\ = Xi and 7 runs up to the 
desired number of iterations, t. Determining a good stopping criterion is a difficult problem. If the goal is to estimate 
the map using the method of Section 4.2, then a promising approach is to use cross-validation. This approach 
would first compute the rescaled diffusion coordinates of the feature ji and then attempt a linear regression from x^. to 
these diffusion coordinates and iterate until the residual ceases to decrease. Another significant issue is that the theory 
of [3] has not yet been extended to use the variable bandwidth kernels of [2]. So even though we have an estimate of 
the optimal bandwidth at each point, we can only apply Theorem 3.1 with a fixed global bandwidth. In our examples 
we found the best choice of global bandwidth was a simple average of the squared distances to the k = 32 nearest 
neighbors of each point, averaged over the whole data set. One reason this ad hoc bandwidth is required is due to 
the large variations in the dimension that occur as the diffusion map iterates, see for example Figure I where some 
parts of the annulus contract to a line before others. These variations of the dimension also require us to use a locally 
rescaled diffusion mapping. Notice that Step 8 (a)-(d) are the standard diffusion map algorithm using the local kernel 
defined by C<k and the associated distances However, to normalize the eigenfunctions in Step 9, we use the locally 
estimated sampling density q{i). Also, to form the rescaled diffusion mapping in Step 11, we use the locally estimated 
dimension d(i). We found that when a global kernel density estimate and a globally estimated dimension were used, 
the distances were scaled very differently in different parts of the data set and this distortion led to numerical problems 
after several iterations. 
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Algorithm 2: The Iterated Ditfusion Map (IDM) 

Inputs: Data sets c E” and c E". Parameters: number of nearest neighbors, k, number of discrete values of the 

bandwidth to consider, L, number of nearest neighbors to use to estimate the global bandwidth, k 2 , number of eigenfunctions 
to use in the diffusion map, M, geometric flow discretization parameter, r, and number of iterations, t. 

Outputs: The M-dimensional IDM embedding x\ = o • • • o which represent the feature yt. 

Set xj = Xi 

For each ^ = 1,..., t 

1. Use Algorithm 1, with inputs {x^] and [yt] to estimate D9i{x^.), the local dimension d(i) = d{x^^) and density q(i) = 

2. Let /(/, j) be the index of the j-ih nearest neighbor of xf and let d(i, j) - - ^\\\ 

3. Define the distance d<j-({i, j) = (1 - T)d{i, j) + T\\DM{xi){x^j^. - x^)\\ with respect to C^{xi) from (14) 

4. Use the ad hoc global bandwidth estimate e = j) 

5. Build the local kernel /(/, j) - exp [-d^{i, jf 12^ 

6. Build a sparse N xN matrix J with = J(i, j) 

1. Symmetrize J - {J + /^)/2 

8. Apply the standard diffusion maps normalizations as in [4, 3] 

(a) Right Normalization: Set Di - Jij and K = Jijl(DiDj) 

(b) Left Normalization: Set A = (liyAv) andTT = Kij/{DiDj) 

(c) Compute the M + 1 largest eigenvalues and associated eigenvectors ipr of K for r = 0,M 

(d) Define Prix^.) = (pi{x^.)IDi 

9. Normalize the eigenvectors with respect to the sampling density pr = ^r! -yjyj Yf=i ^rix^iYlqii) 

10. Set 5 = 106 and Ar = \og{g)/e 

11. Define the rescaled diffusion mapping x^^^ - ^ 
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